Introduction to Statistics

Summarizing data.
Plotting data.
Confidence intervals.
Statistical tests.

About this Notebook

In this notebook, we download a dataset with data about customers. Then, we calculate statistical measures and plot distributions. Finally, we perform statistical tests.



In [1]:

    
# Run this cell :)
1+2









    Out[1]:





3

Importing Needed packages

Statsmodels is a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.



In [3]:

    
# Uncomment next command if you need to install a missing module
#!pip install statsmodels
import matplotlib.pyplot as plt
import pandas as pd
try:
    import statsmodels.api as sm
except:
    !pip install statsmodels
import numpy as np
%matplotlib inline









    



Collecting statsmodels
  Downloading statsmodels-0.6.1.tar.gz (7.0MB)
    100% |████████████████████████████████| 7.0MB 117kB/s 
Requirement already satisfied: pandas in /usr/local/lib/python2.7/dist-packages (from statsmodels)
Collecting patsy (from statsmodels)
  Downloading patsy-0.4.1-py2.py3-none-any.whl (233kB)
    100% |████████████████████████████████| 235kB 3.4MB/s 
Requirement already satisfied: pytz>=2011k in /usr/local/lib/python2.7/dist-packages (from pandas->statsmodels)
Requirement already satisfied: python-dateutil in /usr/local/lib/python2.7/dist-packages (from pandas->statsmodels)
Requirement already satisfied: numpy>=1.7.0 in /usr/local/lib/python2.7/dist-packages (from pandas->statsmodels)
Requirement already satisfied: six in /usr/local/lib/python2.7/dist-packages (from patsy->statsmodels)
Building wheels for collected packages: statsmodels
  Running setup.py bdist_wheel for statsmodels ... - \ | / - \ | / - \ | / - \ done
  Stored in directory: /home/notebook/.cache/pip/wheels/38/d3/1e/94a59b1460b3249b15399e09dae7a3828045bcf830d999b4b1
Successfully built statsmodels
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.4.1 statsmodels-0.6.1

Print the current version of Python:



In [4]:

    
import sys 
print(sys.version)









    



2.7.12 (default, Jul 18 2016, 15:02:52) 
[GCC 4.8.4]

Downloading Data

Run system commands using ! (platform dependant)



In [5]:

    
import sys 
if sys.platform.startswith('linux'):
    !ls
elif sys.platform.startswith('freebsd'):
    !ls
elif sys.platform.startswith('darwin'):
    !ls
elif sys.platform.startswith('win'):
    !dir









    



Basic_statistics_for_Python_3_6.ipynb  Submit-to-Spark-Cluster.ipynb
Basic_statistics.ipynb		       Tutorial #1 - Get Data.ipynb
common				       Untitled.ipynb
data				       US_Baby_Names-2010.ipynb
Getting_started_with_Pandas.ipynb      yob2010.txt
jupyter

To download the data, we will use !wget (on DataScientistWorkbench)



In [6]:

    
if sys.platform.startswith('linux'):
    !wget -O /resources/customer_dbase_sel.csv http://analytics.romanko.ca/data/customer_dbase_sel.csv









    



--2017-01-30 22:00:31--  http://analytics.romanko.ca/data/customer_dbase_sel.csv
Resolving analytics.romanko.ca (analytics.romanko.ca)... 50.116.83.209
Connecting to analytics.romanko.ca (analytics.romanko.ca)|50.116.83.209|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1116177 (1.1M) [text/csv]
Saving to: ‘/resources/customer_dbase_sel.csv’

100%[======================================>] 1,116,177   1.17MB/s   in 0.9s   

2017-01-30 22:00:32 (1.17 MB/s) - ‘/resources/customer_dbase_sel.csv’ saved [1116177/1116177]

Understanding the Data

`customer_dbase_sel.csv`:

We have downloaded an extract from IBM SPSS sample dataset with customer data, customer_dbase_sel.csv, which contains customer-specific data such as age, income, credit card spendings, commute type and time, etc. Dataset source

custid e.g. 0648-AIPJSP-UVM (customer id)
gender e.g. Female or Male
age e.g. 26
debtinc e.g. 11.1 (debt to income ratio in %)
card e.g. Visa, Mastercard (type of primary credit card)
carditems e.g. 1, 2, 3 ... (# of primary credit card purchases in the last month)
cardspent e.g 228.27 (amount in \$ spent on the primary credit card last month)
commute e.g. Walk, Car, Bus (commute type)
commutetime e.g. 22 (time in minutes to commute to work)
income e.g. 16.00 (income in thousands \$ per year)
edcat e.g. College degree, Post-undergraduate degree (education level)

Reading the data in



In [7]:

    
url = "http://analytics.romanko.ca/data/customer_dbase_sel.csv"
df = pd.read_csv(url)

## On DataScientistWorkbench you can read from /resources directory
#df = pd.read_csv("/resources/customer_dbase_sel.csv")

# display first 5 rows of the dataset
df.head()









    Out[7]:






  
    
      
      custid
      gender
      age
      age_cat
      debtinc
      card
      carditems
      cardspent
      cardtype
      creddebt
      ...
      carown
      region
      ed_cat
      ed_years
      job_cat
      employ_years
      emp_cat
      retire
      annual_income
      inc_cat
    
  
  
    
      0
      3964-QJWTRG-NPN
      Female
      20
      18-24
      11.1
      Mastercard
      5
      81.66
      None
      1.20
      ...
      Own
      Zone 1
      Some college
      15
      Managerial and Professional
      0
      Less than 2
      No
      31000.0
      $25 - $49
    
    
      1
      0648-AIPJSP-UVM
      Male
      22
      18-24
      18.6
      Visa
      5
      42.60
      Other
      1.22
      ...
      Own
      Zone 5
      College degree
      17
      Sales and Office
      0
      Less than 2
      No
      15000.0
      Under $25
    
    
      2
      5195-TLUDJE-HVO
      Female
      67
      >65
      9.9
      Visa
      9
      184.22
      None
      0.93
      ...
      Own
      Zone 3
      High school degree
      14
      Sales and Office
      16
      More than 15
      No
      35000.0
      $25 - $49
    
    
      3
      4459-VLPQUH-3OL
      Male
      23
      18-24
      5.7
      Visa
      17
      340.99
      None
      0.02
      ...
      Own
      Zone 4
      Some college
      16
      Sales and Office
      0
      Less than 2
      No
      20000.0
      Under $25
    
    
      4
      8158-SMTQFB-CNO
      Male
      26
      25-34
      1.7
      Discover
      8
      255.10
      Gold
      0.21
      ...
      Lease
      Zone 2
      Some college
      16
      Sales and Office
      1
      Less than 2
      No
      23000.0
      Under $25
    
  

5 rows × 30 columns

Data Exploration



In [8]:

    
# Summarize the data
df.describe()









    



/usr/local/lib/python2.7/dist-packages/numpy/lib/function_base.py:3834: RuntimeWarning: Invalid value encountered in percentile
  RuntimeWarning)






    Out[8]:






  
    
      
      age
      debtinc
      carditems
      cardspent
      creddebt
      commutetime
      card2items
      card2spent
      cars
      ed_years
      employ_years
      annual_income
    
  
  
    
      count
      5000.000000
      5000.000000
      5000.00000
      5000.000000
      5000.000000
      4998.000000
      5000.000000
      5000.000000
      5000.000000
      5000.000000
      5000.000000
      5.000000e+03
    
    
      mean
      46.939800
      9.957800
      10.19920
      339.635878
      1.874982
      25.346739
      4.666000
      161.331270
      2.134200
      14.537600
      9.740200
      5.504060e+04
    
    
      std
      17.703312
      6.423173
      3.39279
      248.382982
      3.441425
      5.890674
      2.482434
      146.798035
      1.306037
      3.294717
      9.691062
      5.554475e+04
    
    
      min
      18.000000
      0.000000
      0.00000
      0.000000
      0.000000
      7.000000
      0.000000
      0.000000
      0.000000
      6.000000
      0.000000
      9.000000e+03
    
    
      25%
      32.000000
      5.175000
      8.00000
      184.860000
      0.390000
      NaN
      3.000000
      67.682500
      1.000000
      12.000000
      2.000000
      2.400000e+04
    
    
      50%
      46.000000
      8.800000
      10.00000
      278.655000
      0.930000
      NaN
      5.000000
      125.455000
      2.000000
      14.000000
      7.000000
      3.800000e+04
    
    
      75%
      62.000000
      13.500000
      12.00000
      422.402500
      2.080000
      NaN
      6.000000
      208.612500
      3.000000
      17.000000
      15.000000
      6.700000e+04
    
    
      max
      79.000000
      43.100000
      23.00000
      3926.410000
      109.070000
      48.000000
      15.000000
      2069.250000
      8.000000
      23.000000
      52.000000
      1.073000e+06



In [9]:

    
# Number of rows and columns in the data
df.shape









    Out[9]:





(5000, 30)



In [10]:

    
# Display column names
df.columns









    Out[10]:





Index([u'custid', u'gender', u'age', u'age_cat', u'debtinc', u'card',
       u'carditems', u'cardspent', u'cardtype', u'creddebt', u'commute',
       u'commutetime', u'card2', u'card2items', u'card2spent', u'card2type',
       u'marital', u'homeown', u'hometype', u'cars', u'carown', u'region',
       u'ed_cat', u'ed_years', u'job_cat', u'employ_years', u'emp_cat',
       u'retire', u'annual_income', u'inc_cat'],
      dtype='object')

Labeling Data

income > 30000 --> High-income --> 1
income < 30000 --> Low-income --> 0



In [11]:

    
# To label data into high-income and low-income
df['income_category'] = df['annual_income'].map(lambda x: 1 if x>30000 else 0)
df[['annual_income','income_category']].head()









    Out[11]:






  
    
      
      annual_income
      income_category
    
  
  
    
      0
      31000.0
      1
    
    
      1
      15000.0
      0
    
    
      2
      35000.0
      1
    
    
      3
      20000.0
      0
    
    
      4
      23000.0
      0

Data Exploration

Select 4 data columns for visualizing:



In [12]:

    
viz = df[['cardspent','debtinc','carditems','commutetime']]
viz.head()









    Out[12]:






  
    
      
      cardspent
      debtinc
      carditems
      commutetime
    
  
  
    
      0
      81.66
      11.1
      5
      22.0
    
    
      1
      42.60
      18.6
      5
      29.0
    
    
      2
      184.22
      9.9
      9
      24.0
    
    
      3
      340.99
      5.7
      17
      38.0
    
    
      4
      255.10
      1.7
      8
      32.0

Compute descriptive statistics for the data:



In [13]:

    
viz.describe()









    Out[13]:






  
    
      
      cardspent
      debtinc
      carditems
      commutetime
    
  
  
    
      count
      5000.000000
      5000.000000
      5000.00000
      4998.000000
    
    
      mean
      339.635878
      9.957800
      10.19920
      25.346739
    
    
      std
      248.382982
      6.423173
      3.39279
      5.890674
    
    
      min
      0.000000
      0.000000
      0.00000
      7.000000
    
    
      25%
      184.860000
      5.175000
      8.00000
      NaN
    
    
      50%
      278.655000
      8.800000
      10.00000
      NaN
    
    
      75%
      422.402500
      13.500000
      12.00000
      NaN
    
    
      max
      3926.410000
      43.100000
      23.00000
      48.000000

Drop NaN (Not-a-Number) observations:



In [14]:

    
df[['commutetime']].dropna().count()









    Out[14]:





commutetime    4998
dtype: int64

Print observations with NaN commutetime:



In [15]:

    
print( df[np.isnan(df["commutetime"])] )









    



               custid  gender  age age_cat  debtinc      card  carditems  \
965   3622-JHDLVP-V1E  Female   48   35-49      6.5  Discover         12   
2734  0860-BRGALK-LLR  Female   68     >65     17.3     Other          8   

      cardspent  cardtype  creddebt       ...         region          ed_cat  \
965      261.91  Platinum      2.25       ...         Zone 1  College degree   
2734     178.75  Platinum      1.08       ...         Zone 5    Some college   

     ed_years                                job_cat  employ_years  \
965        19                                Service            12   
2734       15  Operation, Fabrication, General Labor            20   

           emp_cat retire annual_income     inc_cat  income_category  
965       11 to 15     No      121000.0  $75 - $124                1  
2734  More than 15    Yes       23000.0   Under $25                0  

[2 rows x 31 columns]

Visualize data:



In [16]:

    
viz.hist()
plt.show()



In [17]:

    
df[['cardspent']].hist()
plt.show()



In [18]:

    
df[['commutetime']].hist()
plt.show()

Confidence Intervals

For computing confidence intervals and performing simple statistical tests, we will use the stats sub-module of scipy:



In [19]:

    
from scipy import stats

Confidence intervals tell us how close we think the mean is to the true value, with a certain level of confidence.

We compute mean mu, standard deviation sigma and the number of observations N in our sample of the debt-to-income ratio:



In [20]:

    
mu, sigma = np.mean(df[['debtinc']]), np.std(df[['debtinc']])
print ("mean = %G, st. dev = %g" % (mu, sigma))









    



mean = 9.9578, st. dev = 6.42253



In [21]:

    
N = len(df[['debtinc']])
N









    Out[21]:





5000

The 95% confidence interval for the mean of N draws from a Normal distribution with mean mu and standard deviation sigma is



In [22]:

    
conf_int = stats.norm.interval( 0.95, loc = mu, scale = sigma/np.sqrt(N) )
conf_int









    Out[22]:





(array([ 9.7797798]), array([ 10.1358202]))



In [23]:

    
print ("95%% confidence interval for the mean of debt to income ratio = [%g %g]") % (conf_int[0], conf_int[1])









    



95% confidence interval for the mean of debt to income ratio = [9.77978 10.1358]

Statistical Tests

Select columns by name:



In [24]:

    
adf=df[['gender','cardspent','debtinc']]
print(adf['gender'])









    



0       Female
1         Male
2       Female
3         Male
4         Male
5         Male
6       Female
7       Female
8       Female
9         Male
10      Female
11      Female
12        Male
13        Male
14      Female
15      Female
16      Female
17        Male
18      Female
19      Female
20        Male
21        Male
22        Male
23        Male
24        Male
25      Female
26      Female
27      Female
28      Female
29        Male
         ...  
4970    Female
4971      Male
4972    Female
4973      Male
4974      Male
4975    Female
4976      Male
4977      Male
4978      Male
4979      Male
4980    Female
4981      Male
4982      Male
4983    Female
4984      Male
4985    Female
4986      Male
4987      Male
4988    Female
4989    Female
4990    Female
4991      Male
4992    Female
4993      Male
4994      Male
4995      Male
4996      Male
4997    Female
4998    Female
4999    Female
Name: gender, dtype: object

Compute means for cardspent and debtinc for the male and female populations:



In [25]:

    
gender_data = adf.groupby('gender')
print (gender_data.mean())









    



         cardspent   debtinc
gender                      
Female  323.343489  9.985221
Male    356.606840  9.929236

Compute mean for cardspent for female population only:



In [26]:

    
adf[adf['gender'] == 'Female']['cardspent'].mean()









    Out[26]:





323.34348882791062

We have seen above that the mean cardspent and debtinc in the male and female populations were different. To test if this is significant, we do a 2-sample t-test with scipy.stats.ttest_ind():



In [27]:

    
female_card = adf[adf['gender'] == 'Female']['cardspent']
male_card = adf[adf['gender'] == 'Male']['cardspent']
tc, pc = stats.ttest_ind(female_card, male_card)
print ("t-test: t = %g  p = %g" % (tc, pc))









    



t-test: t = -4.74396  p = 2.15418e-06

In the case of amount spent on primary credit card, we conclude that men tend to charge more on their primary card (p-value = 2e-6 < 0.05, statistically significant).



In [28]:

    
female_debt = adf[adf['gender'] == 'Female']['debtinc']
male_debt   = adf[adf['gender'] == 'Male']['debtinc']
td, pd      = stats.ttest_ind(female_debt, male_debt)
print ("t-test: t = %g  p = %g" % (td, pd))









    



t-test: t = 0.308069  p = 0.758043

In the case of debt-to-income ratio, we conclude that there is no significant difference between men and women (p-value = 0.758 > 0.05, not statistically significant).

Plot Data

Plot statistical measures for amounts spent on primary credit card

Use `boxplot` to compare medians, 25% and 75% percentiles, 12.5% and 87.5% percentiles:



In [29]:

    
adf.boxplot(column='cardspent', by='gender', grid=False, showfliers=False)
plt.show()

Plot observations with `boxplot`:



In [30]:

    
gend = list(['Female', 'Male'])
for i in [1,2]:
    y = adf.cardspent[adf.gender==gend[i-1]].dropna()        
    # Add some random "jitter" to the x-axis
    x = np.random.normal(i, 0.04, size=len(y))
    plt.plot(x, y, 'r.', alpha=0.2)
plt.boxplot([female_card,male_card],labels=gend)
plt.ylabel("cardspent")
plt.ylim((-50,850))    
plt.show()

Plot age vs. income data to find some interesting relationships.



In [31]:

    
plt.scatter(df.age, df.annual_income)
plt.xlabel("Age")
plt.ylabel("Income")
plt.show()

	custid	gender	age	age_cat	debtinc	card	carditems	cardspent	cardtype	creddebt	...	carown	region	ed_cat	ed_years	job_cat	employ_years	emp_cat	retire	annual_income	inc_cat
0	3964-QJWTRG-NPN	Female	20	18-24	11.1	Mastercard	5	81.66	None	1.20	...	Own	Zone 1	Some college	15	Managerial and Professional	0	Less than 2	No	31000.0	$25 - $49
1	0648-AIPJSP-UVM	Male	22	18-24	18.6	Visa	5	42.60	Other	1.22	...	Own	Zone 5	College degree	17	Sales and Office	0	Less than 2	No	15000.0	Under $25
2	5195-TLUDJE-HVO	Female	67	>65	9.9	Visa	9	184.22	None	0.93	...	Own	Zone 3	High school degree	14	Sales and Office	16	More than 15	No	35000.0	$25 - $49
3	4459-VLPQUH-3OL	Male	23	18-24	5.7	Visa	17	340.99	None	0.02	...	Own	Zone 4	Some college	16	Sales and Office	0	Less than 2	No	20000.0	Under $25
4	8158-SMTQFB-CNO	Male	26	25-34	1.7	Discover	8	255.10	Gold	0.21	...	Lease	Zone 2	Some college	16	Sales and Office	1	Less than 2	No	23000.0	Under $25

	age	debtinc	carditems	cardspent	creddebt	commutetime	card2items	card2spent	cars	ed_years	employ_years	annual_income
count	5000.000000	5000.000000	5000.00000	5000.000000	5000.000000	4998.000000	5000.000000	5000.000000	5000.000000	5000.000000	5000.000000	5.000000e+03
mean	46.939800	9.957800	10.19920	339.635878	1.874982	25.346739	4.666000	161.331270	2.134200	14.537600	9.740200	5.504060e+04
std	17.703312	6.423173	3.39279	248.382982	3.441425	5.890674	2.482434	146.798035	1.306037	3.294717	9.691062	5.554475e+04
min	18.000000	0.000000	0.00000	0.000000	0.000000	7.000000	0.000000	0.000000	0.000000	6.000000	0.000000	9.000000e+03
25%	32.000000	5.175000	8.00000	184.860000	0.390000	NaN	3.000000	67.682500	1.000000	12.000000	2.000000	2.400000e+04
50%	46.000000	8.800000	10.00000	278.655000	0.930000	NaN	5.000000	125.455000	2.000000	14.000000	7.000000	3.800000e+04
75%	62.000000	13.500000	12.00000	422.402500	2.080000	NaN	6.000000	208.612500	3.000000	17.000000	15.000000	6.700000e+04
max	79.000000	43.100000	23.00000	3926.410000	109.070000	48.000000	15.000000	2069.250000	8.000000	23.000000	52.000000	1.073000e+06

	cardspent	debtinc	carditems	commutetime
0	81.66	11.1	5	22.0
1	42.60	18.6	5	29.0
2	184.22	9.9	9	24.0
3	340.99	5.7	17	38.0
4	255.10	1.7	8	32.0

Introduction to Statistics

About this Notebook

Importing Needed packages

Print the current version of Python:

Downloading Data

Understanding the Data

customer_dbase_sel.csv:

Reading the data in

Data Exploration

Labeling Data

Data Exploration

Select 4 data columns for visualizing:

Compute descriptive statistics for the data:

Visualize data:

Confidence Intervals

Statistical Tests

Plot Data

Plot statistical measures for amounts spent on primary credit card

Use boxplot to compare medians, 25% and 75% percentiles, 12.5% and 87.5% percentiles:

Plot observations with boxplot:

Plot age vs. income data to find some interesting relationships.

`customer_dbase_sel.csv`:

Use `boxplot` to compare medians, 25% and 75% percentiles, 12.5% and 87.5% percentiles:

Plot observations with `boxplot`: